-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: issue#915, Error for large integers in Series #1233
base: develop
Are you sure you want to change the base?
Conversation
@alexbarros you might want to look into this as well if you get the chance :) |
bins = np.histogram_bin_edges(finite_values, bins=bins_arg) | ||
stats[name] = np.histogram(finite_values, bins=bins, weights=weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting solution, but still seems to behave a bit weirdly with big numbers
> import numpy as np
> arr = np.array([716277643516076032 + i for i in range(100)])
> bins = np.histogram_bin_edges(arr, bins=5)
> np.histogram(arr, bins=bins)
(array([ 0, 0, 65, 0, 35]),
array([7.16277644e+17, 7.16277644e+17, 7.16277644e+17, 7.16277644e+17,
7.16277644e+17, 7.16277644e+17]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it is still not evenly distributed like it should be for smaller numbers. What do you propose here? Leaving np.histogram_bin_edges
raises an error for larger numbers. Is it better to raise error than have weird behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking from a user's perspective, for me is better to have an error being raised than an incorrect plot. If I know that there was a problem with the large integers I can preprocess that column and run again, but an incorrect result may lead me to an incorrect interpretation of my data distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is what I was thinking as well. I think I should make the changes so that it leads to raising an error rather than making an incorrect plot, right?
Also there is a check Codacy Static Code Analysis
that is failing. I think that is a new one
Codecov ReportBase: 90.47% // Head: 90.49% // Increases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## develop #1233 +/- ##
===========================================
+ Coverage 90.47% 90.49% +0.01%
===========================================
Files 185 186 +1
Lines 5661 5681 +20
===========================================
+ Hits 5122 5141 +19
- Misses 539 540 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
6ba2217
to
ef023c3
Compare
4500563
to
cfb020d
Compare
This patch fixes issue #915: Error for series with large integers.
The issue is caused when using the
numpy.histogram
function with large integers which causes unevenly spaced bin edges.here:
https://github.com/ydataai/pandas-profiling/blob/5b1abac48ed9ed5a9e7e662be30c913acc3e7a5b/src/pandas_profiling/model/summary_algorithms.py#L39
and here:
https://github.com/ydataai/pandas-profiling/blob/5b1abac48ed9ed5a9e7e662be30c913acc3e7a5b/src/pandas_profiling/model/summary_algorithms.py#L52
This can cause the resulting histogram to be distorted or misleading, as the bin sizes may not be uniform.
To resolve this issue, I used the
numpy.histogram_bin_edges
function to compute the bin edges for the data before passing them to thenumpy.histogram
function. This function allows to specify the number of bins and the range of the data, and will compute the bin edges in a way that ensures they are evenly spaced. This fix does not raise an error as reported in the bug report and successfully generates a report. I have also included atest_issue915.py
for testing the generation of the report.